In this tutorial we will cover the basics of using Google Custom Search to search the Interner.
Links:
Video Tutorial:
In [1]:
import json
import requests
import pandas as pd
Google Custom Search replaces the depreciated Google Search API. It is designed to search one or more website and to embedded with in the website.
There is still an options to search the complete web. This options combined with no specified website to search retuen results which are very close to what you get when you search Google. The difference in results is due to personalized and localized search results that Google search returns.
In [2]:
key = ""
cx = ""
Parameter name | Value | Description |
---|---|---|
Required parameters | ||
q |
string |
The search expression. |
Optional parameters | ||
c2coff |
string |
Enables or disables Simplified and Traditional Chinese Search.
|
cr |
string |
Restricts search results to documents originating in a particular country.
|
cref |
string |
The URL of a linked custom search engine specification to use for this request.
|
cx |
string |
The custom search engine ID to use for this request.
|
dateRestrict |
string |
Restricts results to URLs based on date. Supported values include:
|
exactTerms |
string |
Identifies a phrase that all documents in the search results must contain. |
excludeTerms |
string |
Identifies a word or phrase that should not appear in any documents in the search results. |
fileType |
string |
Restricts results to files of a specified extension. A list of file types indexable by Google can be found in Webmaster Tools Help Center. |
filter |
string |
Controls turning on or off the duplicate content filter.
Acceptable values are:
|
gl |
string |
Geolocation of end user.
|
googlehost |
string |
The local Google domain (for example, google.com, google.de, or google.fr) to use to perform the search. |
highRange |
string |
|
hl |
string |
Sets the user interface language.
|
hq |
string |
Appends the specified query terms to the query, as if they were combined with a logical AND operator.
|
imgColorType |
string |
Returns black and white, grayscale, or color images: mono , gray , and color .
Acceptable values are:
|
imgDominantColor |
string |
Returns images of a specific dominant color.
Acceptable values are:
|
imgSize |
string |
Returns images of a specified size.
Acceptable values are:
|
imgType |
string |
Returns images of a type.
Acceptable values are:
|
linkSite |
string |
Specifies that all search results should contain a link to a particular URL |
lowRange |
string |
Specifies the starting value for a search range. Use lowRange and highRange to append an inclusive search range of lowRange...highRange to the query. |
lr |
string |
Restricts the search to documents written in a particular language (e.g., lr=lang_ja ).
Acceptable values are:
|
num |
unsigned integer |
Number of search results to return.
|
orTerms |
string |
Provides additional search terms to check for in a document, where each document in the search results must contain at least one of the additional search terms. |
relatedSite |
string |
Specifies that all search results should be pages that are related to the specified URL. |
rights |
string |
Filters based on licensing. Supported values include: cc_publicdomain , cc_attribute , cc_sharealike , cc_noncommercial , cc_nonderived, and combinations of these.
|
safe |
string |
Search safety level.
Acceptable values are:
|
searchType |
string |
Specifies the search type: image . If unspecified, results are limited to webpages.
Acceptable values are:
|
siteSearch |
string |
Specifies all search results should be pages from a given site. |
siteSearchFilter |
string |
Controls whether to include or exclude results from the site named in the siteSearch parameter.
Acceptable values are:
|
sort |
string |
The sort expression to apply to the results. |
start |
unsigned integer |
The index of the first result to return. |
In [3]:
url = "https://www.googleapis.com/customsearch/v1"
parameters = {"q": "halloween",
"cx": cx,
"key": key,
}
In [4]:
page = requests.request("GET", url, params=parameters)
In [5]:
results = json.loads(page.text)
In [6]:
results.keys()
Out[6]:
In [7]:
results["kind"]
Out[7]:
In [8]:
results["url"]
Out[8]:
In [9]:
len(results["items"])
Out[9]:
In [10]:
results["queries"]
Out[10]:
In [11]:
results["searchInformation"]
Out[11]:
In [12]:
results["items"][0]
Out[12]:
In [13]:
def process_search(results):
link_list = [item["link"] for item in results["items"]]
df = pd.DataFrame(link_list, columns=["link"])
df["title"] = [item["title"] for item in results["items"]]
df["snippet"] = [item["snippet"] for item in results["items"]]
return df
df = process_search(results)
df
Out[13]:
Use "start"
parameter to skip results from previous pages. To get the next "start"
index look it up in "queries.nextPage[0].startIndex"
In [14]:
next_index = results["queries"]["nextPage"][0]["startIndex"]
search_terms = results["queries"]["nextPage"][0]["searchTerms"]
url = "https://www.googleapis.com/customsearch/v1"
parameters = {"q": search_terms,
"cx": cx,
"key": key,
"start": next_index
}
In [15]:
page = requests.request("GET", url, params=parameters)
results = json.loads(page.text)
In [16]:
def process_search(results):
link_list = [item["link"] for item in results["items"]]
df = pd.DataFrame(link_list, columns=["link"])
df["title"] = [item["title"] for item in results["items"]]
df["snippet"] = [item["snippet"] for item in results["items"]]
return df
temp_df = process_search(results)
df = pd.concat([df, temp_df], ignore_index=True)
df
Out[16]: